mappgene summary

Jeff Kimbrel

Thu Nov 4 16:30:24 2021

Background

Load and Process Files

The summary is looking for files in the /Library/Frameworks/R.framework/Versions/4.1/Resources/library/MappgeneSummary/test_files/ directory, and will write the following files there. Creation of the .Rds files are typically the slowest part of this script, so they are saved to speed up testing and creation of the html summary.

File Name File Type Notes
_summary_lofreq.Rds R data The lofreq data (tibble)
_summary_ivar.Rds R data The ivar data (tibble)
_summary_DF.Rds R data The “final” data use for all analysis (tibble)
_summary_DF.txt tab-delimited The “final” data use for all analysis
_lofreq_top_outbreak_results.txt tab-delimited The Outbreak API results from the top lofreq iSNVs
_ivar_top_outbreak_results.txt tab-delimited The Outbreak API results from the top ivar iSNVs
_amplicon_summary.txt tab-delimited Attempt at re-creating a specific report for Excel analysis
_amplicon_results.txt tab-delimited Attempt at re-creating a specific report for Excel analysis

Lofreq (iSNVs <1%)

This is the “iVar -> LoFreq -> snpEFF/snpSIFT” workflow, and these results will have the “lofreq” PIPELINE string.

The ALT_DP column has been added as as.integer(DP * AF).

Also, converting the HGVS_P column to REF_AA and ALT_AA, and removing any synonymous mutations.

iVar (iSNVs > 1%)

This is the “iVar -> snpEFF/snpSIFT” workflow, and these results will have the “ivar” PIPELINE string.

iVar output was processed according to Jimmy’s pipeline

  1. removing variant position with an ALT_QUAL of 20 (these are almost entirely single nt insertions),
  2. removing positions that did not pass the Fisher test (PASS=FALSE)
  3. removing synonymous mutations (REF_AA=ALT_AA)

The ALT_DP column has been added as as.integer(DP * AF).

Also, converting the HGVS_P column to REF_AA and ALT_AA, and removing any synonymous mutations.

Pipeline Summary and Merging

SAMPLEs: There are 3 total samples with data so far, 0 done with just lofreq, 0 done with just ivar, and 3 with both.

SNVs: There are 244 total SNVs identified so far that are below 1% in ‘lofreq’ in at least one sample, or above 1% in ‘ivar’ in at least one sample, 213 found with just lofreq, 26 found with just ivar, and 5 found with both.

Effects

snpSIFT classifies the SNV effect into several categories. Here are the categories and their total frequencies in all samples. Only the “blue” categories are used in this analysis. Most the of “ivar” effect categories are removed with the ALT_QUAL > 20 filtering.

EFFECT SO Description
missense_variant SO:0001583 A sequence variant, that changes one or more bases, resulting in a different amino acid sequence but where the length is preserved.
stop_gained SO:0001587 A sequence variant whereby at least one base of a codon is changed, resulting in a premature stop codon, leading to a shortened polypeptide
conservative_inframe_deletion SO:0001825 An inframe decrease in cds length that deletes one or more entire codons from the coding sequence but does not change any remaining codons
disruptive_inframe_deletion SO:0001826 An inframe decrease in cds length that deletes bases from the coding sequence starting within an existing codon
conservative_inframe_insertion SO:0001823 An inframe increase in cds length that inserts one or more codons into the coding sequence between existing codons
disruptive_inframe_insertion SO:0001824 An inframe increase in cds length that inserts one or more codons into the coding sequence within an existing codon.

ALT_AAs and EFFECT

SNV Sample Count per Pipeline

Each SNV was given a unique identifier with the gene name, the chromosome position, and the REF and ALT nucleotide, all separated by .. So, S.23367.C.A is in the S gene, chromosome position 23,367, and results in the reference “C” converted to a “A”.

Below is the Sample count of all unique SNVs in both the lofreq and ivar pipelines.

Visualization

Just some examples of different plots that can be generated. These are typically static images, although some interaction could potentially be added later.

Heatmap

Showing only the S protein.

NMDS

Between Pipelines

The stress of the above plot is 0.0054846 (values > 10-15% are generally “high stress”).

Between Samples

If a SNV was found in a sample by both ivar and lofreq, the frequency is averaged.

The stress of the above plot is 0 (values > 10-15% are generally “high stress”).

Summary Tables

Amplicon Summary

This table is too big to display here. It is saved as a tab-delimited file here: ./_amplicon_summary.txt.

Amplicon Result

This table is too big to display here. It is saved as a tab-delimited file here: ./_amplicon_result.txt.

outbreak.info API

Note, only mutations with the ‘missense_variant’ EFFECT are searchable in the API. All others are not included below.

S Gene

S:G566V not found in CA
S:N658H not found in CA

Session Info

name value
version R version 4.1.0 (2021-05-18)
os macOS Catalina 10.15.7
system x86_64, darwin17.0
ui X11
language (EN)
collate en_US.UTF-8
ctype en_US.UTF-8
tz America/Los_Angeles
date 2021-11-04